This project aims to explore data on housing prices in an effort to design a model for estimating pricing and other behaviors of the housing market, specifically within King County.
suppressPackageStartupMessages
(
{
#install or load required packages into the workspace
package_list <- c('tidyverse', 'knitr', 'plotly','RColorBrewer','sqldf')
non_installed <- package_list[!(package_list %in% installed.packages()[,"Package"])]
if(length(non_installed)) install.packages(non_installed)
#library('tidyverse')
library('plotly')
library('RColorBrewer')
library('sqldf')
library('knitr')
}
)
knitr::opts_chunk$set(message = F, warning = F, strip.white = F, tidy = T)
load in the required packages
#load dataset from project folder
suppressWarnings({housing_data <- read.csv("kc_house_data.csv", stringsAsFactors = F)})
#remove unused columns and normalize the data
housing_data_rev <- housing_data[, 3:21]
load in dataset and remove unused features
#collect a random sampling of the dataset ordered by price
train_vals <- sample(dim(housing_data)[1], 1000)
test_vals <- sample(dim(housing_data)[1], 200)
#full test set?
#test_vals <- !(1:dim(housing_data)[1] %in% train_vals)
housing_train_data <- as.data.frame(housing_data_rev[train_vals,])
housing_train_data <- housing_train_data[order(-housing_train_data$price),]
housing_test_data <- as.data.frame(housing_data_rev[test_vals,])
housing_test_data <- housing_test_data[order(-housing_test_data$price),]
sqldf("Select price As Pricing, grade As Rating, condition As Condition, sqft_lot As Sq_Footage
From housing_train_data
Limit 20")
This dataset is very large, so in order to get a better visual representation of the distribution of the data, a sample of 1000 randomly chosen rows is used instead of the whole dataset (ordered by price, as this is the dependent variable). The sample indexes are selected at runtime every time the script runs, so the data and analysis are different each time the script is run. Because the sampling is random and the dataset is homogenous and static, the 1000 rows chosen should always be representative of the nature of the whole dataset.
#Look at the distributions:
p1 <- plot_ly() %>% add_markers(name = "Pricing", x = 1:1000, y = housing_train_data$price) %>% layout(title = "King County Housing Feature Distributions")
p2 <- plot_ly() %>% add_markers(name = "Rating v Pricing", x = housing_train_data$grade, y = housing_train_data$price)
p3 <- plot_ly() %>% add_markers(name = "Condition v Pricing", x = housing_train_data$condition, y = housing_train_data$price)
p4 <- plot_ly() %>% add_markers(name = "Sq. Footage v Pricing", x = housing_train_data$sqft_lot, y = housing_train_data$price)
subplot(p1, p2, p3, p4, nrows = 2)
Looking at the distributions of prices by magnitude, it appears to follow a power law distribution that isn’t mirrored in the distributions of the other features seen here. Assuming that the features in this dataset capture most of the factors that influence the pricing of a home, some linear combination of the features could give a reasonable estimate for the pricing of an individual house. Since the distribution is closer to a power law, two seperate linear functions should give a pretty good estimate by accounting for the ‘knee’ in the pricing distribution. First though, we’ll try fitting a single line to the data and see what kind of error comes out.
#split data for training
#test general linear model on relevant quantitative data
housing_train_price <- housing_train_data[, "price"]
housing_test_price <- housing_test_data[, "price"]
glm_mod <- glm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + condition + grade + yr_built + yr_renovated, data = housing_train_data)
price_mod <- predict(glm_mod, newdata = housing_test_data)
price_est_plotly <- plot_ly() %>%
add_markers(name = "Test Set Price", x = 1:length(housing_test_price), y = housing_test_price) %>%
add_markers(name = "Linear Model est", x = 1:length(housing_test_price), y = price_mod) %>%
add_lines(name = "Magnitude Difference", x = 1:length(housing_test_price), y = price_mod - housing_test_price) %>%
layout(title = "Attempting Linear Regression Model </br>
for Pricing Homes in King County </br>")
price_est_plotly
Given that the distribution of prices was already known to follow a power law distribution, the result of this linear estimate isn’t very surprising. The linear fit is good for the long tail, but the whole is corrupted by trying to fit the collection of values above the elbow.
LS0tDQp0aXRsZTogIktpbmcgQ291bnR5IEhvdXNpbmcgUHJpY2VzIg0KYXV0aG9yOiAiRHVuY2FuIE1jS2lubm9uIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KVGhpcyBwcm9qZWN0IGFpbXMgdG8gZXhwbG9yZSBkYXRhIG9uIGhvdXNpbmcgcHJpY2VzIGluIGFuIGVmZm9ydCB0byBkZXNpZ24gYSBtb2RlbCBmb3IgZXN0aW1hdGluZyBwcmljaW5nIGFuZCBvdGhlciBiZWhhdmlvcnMgb2YgdGhlIGhvdXNpbmcgbWFya2V0LCBzcGVjaWZpY2FsbHkgd2l0aGluIEtpbmcgQ291bnR5LiANCg0KDQpgYGB7cn0NCg0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzDQooDQogIHsNCiAgICAjaW5zdGFsbCBvciBsb2FkIHJlcXVpcmVkIHBhY2thZ2VzIGludG8gdGhlIHdvcmtzcGFjZQ0KICAgIHBhY2thZ2VfbGlzdCA8LSBjKCd0aWR5dmVyc2UnLCAna25pdHInLCAncGxvdGx5JywnUkNvbG9yQnJld2VyJywnc3FsZGYnKQ0KICAgIG5vbl9pbnN0YWxsZWQgPC0gcGFja2FnZV9saXN0WyEocGFja2FnZV9saXN0ICVpbiUgaW5zdGFsbGVkLnBhY2thZ2VzKClbLCJQYWNrYWdlIl0pXQ0KICAgIGlmKGxlbmd0aChub25faW5zdGFsbGVkKSkgaW5zdGFsbC5wYWNrYWdlcyhub25faW5zdGFsbGVkKQ0KICAgICNsaWJyYXJ5KCd0aWR5dmVyc2UnKQ0KICAgIGxpYnJhcnkoJ3Bsb3RseScpDQogICAgbGlicmFyeSgnUkNvbG9yQnJld2VyJykNCiAgICBsaWJyYXJ5KCdzcWxkZicpDQogICAgbGlicmFyeSgna25pdHInKQ0KICB9DQopDQprbml0cjo6b3B0c19jaHVuayRzZXQobWVzc2FnZSA9IEYsIHdhcm5pbmcgPSBGLCBzdHJpcC53aGl0ZSA9IEYsIHRpZHkgPSBUKQ0KYGBgDQpsb2FkIGluIHRoZSByZXF1aXJlZCBwYWNrYWdlcw0KDQoNCmBgYHtyfQ0KI2xvYWQgZGF0YXNldCBmcm9tIHByb2plY3QgZm9sZGVyDQpzdXBwcmVzc1dhcm5pbmdzKHtob3VzaW5nX2RhdGEgPC0gcmVhZC5jc3YoImtjX2hvdXNlX2RhdGEuY3N2Iiwgc3RyaW5nc0FzRmFjdG9ycyA9IEYpfSkNCg0KDQojcmVtb3ZlIHVudXNlZCBjb2x1bW5zIGFuZCBub3JtYWxpemUgdGhlIGRhdGENCmhvdXNpbmdfZGF0YV9yZXYgPC0gaG91c2luZ19kYXRhWywgMzoyMV0NCmBgYA0KbG9hZCBpbiBkYXRhc2V0IGFuZCByZW1vdmUgdW51c2VkIGZlYXR1cmVzDQoNCg0KYGBge3J9DQojY29sbGVjdCBhIHJhbmRvbSBzYW1wbGluZyBvZiB0aGUgZGF0YXNldCBvcmRlcmVkIGJ5IHByaWNlDQp0cmFpbl92YWxzIDwtIHNhbXBsZShkaW0oaG91c2luZ19kYXRhKVsxXSwgMTAwMCkNCnRlc3RfdmFscyA8LSBzYW1wbGUoZGltKGhvdXNpbmdfZGF0YSlbMV0sIDIwMCkNCg0KI2Z1bGwgdGVzdCBzZXQ/DQojdGVzdF92YWxzIDwtICEoMTpkaW0oaG91c2luZ19kYXRhKVsxXSAlaW4lIHRyYWluX3ZhbHMpDQoNCmhvdXNpbmdfdHJhaW5fZGF0YSA8LSBhcy5kYXRhLmZyYW1lKGhvdXNpbmdfZGF0YV9yZXZbdHJhaW5fdmFscyxdKQ0KaG91c2luZ190cmFpbl9kYXRhIDwtIGhvdXNpbmdfdHJhaW5fZGF0YVtvcmRlcigtaG91c2luZ190cmFpbl9kYXRhJHByaWNlKSxdDQpob3VzaW5nX3Rlc3RfZGF0YSA8LSBhcy5kYXRhLmZyYW1lKGhvdXNpbmdfZGF0YV9yZXZbdGVzdF92YWxzLF0pDQpob3VzaW5nX3Rlc3RfZGF0YSA8LSBob3VzaW5nX3Rlc3RfZGF0YVtvcmRlcigtaG91c2luZ190ZXN0X2RhdGEkcHJpY2UpLF0NCg0Kc3FsZGYoIlNlbGVjdCBwcmljZSBBcyBQcmljaW5nLCBncmFkZSBBcyBSYXRpbmcsIGNvbmRpdGlvbiBBcyBDb25kaXRpb24sIHNxZnRfbG90IEFzIFNxX0Zvb3RhZ2UgDQogICAgICAgRnJvbSBob3VzaW5nX3RyYWluX2RhdGENCiAgICAgICBMaW1pdCAyMCIpDQoNCmBgYA0KVGhpcyBkYXRhc2V0IGlzIHZlcnkgbGFyZ2UsIHNvIGluIG9yZGVyIHRvIGdldCBhIGJldHRlciB2aXN1YWwgcmVwcmVzZW50YXRpb24gb2YgdGhlIGRpc3RyaWJ1dGlvbiBvZiB0aGUgZGF0YSwgYSBzYW1wbGUgb2YgMTAwMCByYW5kb21seSBjaG9zZW4gcm93cyBpcyB1c2VkIGluc3RlYWQgb2YgdGhlIHdob2xlIGRhdGFzZXQgKG9yZGVyZWQgYnkgcHJpY2UsIGFzIHRoaXMgaXMgdGhlIGRlcGVuZGVudCB2YXJpYWJsZSkuICBUaGUgc2FtcGxlIGluZGV4ZXMgYXJlIHNlbGVjdGVkIGF0IHJ1bnRpbWUgZXZlcnkgdGltZSB0aGUgc2NyaXB0IHJ1bnMsIHNvIHRoZSBkYXRhIGFuZCBhbmFseXNpcyBhcmUgZGlmZmVyZW50IGVhY2ggdGltZSB0aGUgc2NyaXB0IGlzIHJ1bi4gQmVjYXVzZSB0aGUgc2FtcGxpbmcgaXMgcmFuZG9tIGFuZCB0aGUgZGF0YXNldCBpcyBob21vZ2Vub3VzIGFuZCBzdGF0aWMsIHRoZSAxMDAwIHJvd3MgY2hvc2VuIHNob3VsZCBhbHdheXMgYmUgcmVwcmVzZW50YXRpdmUgb2YgdGhlIG5hdHVyZSBvZiB0aGUgd2hvbGUgZGF0YXNldC4NCg0KDQpgYGB7cn0NCiNMb29rIGF0IHRoZSBkaXN0cmlidXRpb25zOg0KcDEgPC0gcGxvdF9seSgpICU+JSBhZGRfbWFya2VycyhuYW1lID0gIlByaWNpbmciLCB4ID0gMToxMDAwLCB5ID0gaG91c2luZ190cmFpbl9kYXRhJHByaWNlKSAlPiUgbGF5b3V0KHRpdGxlID0gIktpbmcgQ291bnR5IEhvdXNpbmcgRmVhdHVyZSBEaXN0cmlidXRpb25zIikNCnAyIDwtIHBsb3RfbHkoKSAlPiUgYWRkX21hcmtlcnMobmFtZSA9ICJSYXRpbmcgdiBQcmljaW5nIiwgeCA9IGhvdXNpbmdfdHJhaW5fZGF0YSRncmFkZSwgeSA9IGhvdXNpbmdfdHJhaW5fZGF0YSRwcmljZSkNCnAzIDwtIHBsb3RfbHkoKSAlPiUgYWRkX21hcmtlcnMobmFtZSA9ICJDb25kaXRpb24gdiBQcmljaW5nIiwgeCA9IGhvdXNpbmdfdHJhaW5fZGF0YSRjb25kaXRpb24sIHkgPSBob3VzaW5nX3RyYWluX2RhdGEkcHJpY2UpDQpwNCA8LSBwbG90X2x5KCkgJT4lIGFkZF9tYXJrZXJzKG5hbWUgPSAiU3EuIEZvb3RhZ2UgdiBQcmljaW5nIiwgeCA9IGhvdXNpbmdfdHJhaW5fZGF0YSRzcWZ0X2xvdCwgeSA9IGhvdXNpbmdfdHJhaW5fZGF0YSRwcmljZSkNCg0Kc3VicGxvdChwMSwgcDIsIHAzLCBwNCwgbnJvd3MgPSAyKQ0KYGBgDQpMb29raW5nIGF0IHRoZSBkaXN0cmlidXRpb25zIG9mIHByaWNlcyBieSBtYWduaXR1ZGUsIGl0IGFwcGVhcnMgdG8gZm9sbG93IGEgcG93ZXIgbGF3IGRpc3RyaWJ1dGlvbiB0aGF0IGlzbid0IG1pcnJvcmVkIGluIHRoZSBkaXN0cmlidXRpb25zIG9mIHRoZSBvdGhlciBmZWF0dXJlcyBzZWVuIGhlcmUuICBBc3N1bWluZyB0aGF0IHRoZSBmZWF0dXJlcyBpbiB0aGlzIGRhdGFzZXQgY2FwdHVyZSBtb3N0IG9mIHRoZSBmYWN0b3JzIHRoYXQgaW5mbHVlbmNlIHRoZSBwcmljaW5nIG9mIGEgaG9tZSwgc29tZSBsaW5lYXIgY29tYmluYXRpb24gb2YgdGhlIGZlYXR1cmVzIGNvdWxkIGdpdmUgYSByZWFzb25hYmxlIGVzdGltYXRlIGZvciB0aGUgcHJpY2luZyBvZiBhbiBpbmRpdmlkdWFsIGhvdXNlLiAgU2luY2UgdGhlIGRpc3RyaWJ1dGlvbiBpcyBjbG9zZXIgdG8gYSBwb3dlciBsYXcsIHR3byBzZXBlcmF0ZSBsaW5lYXIgZnVuY3Rpb25zIHNob3VsZCBnaXZlIGEgcHJldHR5IGdvb2QgZXN0aW1hdGUgYnkgYWNjb3VudGluZyBmb3IgdGhlICdrbmVlJyBpbiB0aGUgcHJpY2luZyBkaXN0cmlidXRpb24uICBGaXJzdCB0aG91Z2gsIHdlJ2xsIHRyeSBmaXR0aW5nIGEgc2luZ2xlIGxpbmUgdG8gdGhlIGRhdGEgYW5kIHNlZSB3aGF0IGtpbmQgb2YgZXJyb3IgY29tZXMgb3V0Lg0KDQpgYGB7ciBtZXNzYWdlID0gRiwgd2FybmluZyA9IEZ9DQojc3BsaXQgZGF0YSBmb3IgdHJhaW5pbmcNCiN0ZXN0IGdlbmVyYWwgbGluZWFyIG1vZGVsIG9uIHJlbGV2YW50IHF1YW50aXRhdGl2ZSBkYXRhDQoNCmhvdXNpbmdfdHJhaW5fcHJpY2UgPC0gaG91c2luZ190cmFpbl9kYXRhWywgInByaWNlIl0NCmhvdXNpbmdfdGVzdF9wcmljZSA8LSBob3VzaW5nX3Rlc3RfZGF0YVssICJwcmljZSJdDQoNCg0KZ2xtX21vZCA8LSBnbG0ocHJpY2UgfiBiZWRyb29tcyArIGJhdGhyb29tcyArIHNxZnRfbGl2aW5nICsgc3FmdF9sb3QgKyBjb25kaXRpb24gKyBncmFkZSArIHlyX2J1aWx0ICsgeXJfcmVub3ZhdGVkLCBkYXRhID0gaG91c2luZ190cmFpbl9kYXRhKQ0KDQpwcmljZV9tb2QgPC0gcHJlZGljdChnbG1fbW9kLCBuZXdkYXRhID0gaG91c2luZ190ZXN0X2RhdGEpDQoNCnByaWNlX2VzdF9wbG90bHkgPC0gcGxvdF9seSgpICU+JSANCiAgYWRkX21hcmtlcnMobmFtZSA9ICJUZXN0IFNldCBQcmljZSIsIHggPSAxOmxlbmd0aChob3VzaW5nX3Rlc3RfcHJpY2UpLCB5ID0gaG91c2luZ190ZXN0X3ByaWNlKSAlPiUNCiAgYWRkX21hcmtlcnMobmFtZSA9ICJMaW5lYXIgTW9kZWwgZXN0IiwgeCA9IDE6bGVuZ3RoKGhvdXNpbmdfdGVzdF9wcmljZSksIHkgPSBwcmljZV9tb2QpICU+JQ0KICBhZGRfbGluZXMobmFtZSA9ICJNYWduaXR1ZGUgRGlmZmVyZW5jZSIsIHggPSAxOmxlbmd0aChob3VzaW5nX3Rlc3RfcHJpY2UpLCB5ID0gcHJpY2VfbW9kIC0gaG91c2luZ190ZXN0X3ByaWNlKSAlPiUNCiAgbGF5b3V0KHRpdGxlID0gIkF0dGVtcHRpbmcgTGluZWFyIFJlZ3Jlc3Npb24gTW9kZWwgPC9icj4gDQogICAgICAgICBmb3IgUHJpY2luZyBIb21lcyBpbiBLaW5nIENvdW50eSA8L2JyPiIpDQoNCnByaWNlX2VzdF9wbG90bHkNCmBgYA0KR2l2ZW4gdGhhdCB0aGUgZGlzdHJpYnV0aW9uIG9mIHByaWNlcyB3YXMgYWxyZWFkeSBrbm93biB0byBmb2xsb3cgYSBwb3dlciBsYXcgZGlzdHJpYnV0aW9uLCB0aGUgcmVzdWx0IG9mIHRoaXMgbGluZWFyIGVzdGltYXRlIGlzbid0IHZlcnkgc3VycHJpc2luZy4gIFRoZSBsaW5lYXIgZml0IGlzIGdvb2QgZm9yIHRoZSBsb25nIHRhaWwsIGJ1dCB0aGUgd2hvbGUgaXMgY29ycnVwdGVkIGJ5IHRyeWluZyB0byBmaXQgdGhlIGNvbGxlY3Rpb24gb2YgdmFsdWVzIGFib3ZlIHRoZSBlbGJvdy4NCg0KYGBge3J9DQoNCg0KDQpgYGANCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg0KDQoNCg==